Let’s start with reading the computed metrics for all projects.
## [1] TRUE
## 'data.frame': 2835 obs. of 22 variables:
## $ project : Factor w/ 13 levels "black","cookiecutter",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bug_number : int 1 3 4 6 7 8 10 11 14 15 ...
## $ granularity : Factor w/ 3 levels "function","statement",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ technique : Factor w/ 7 levels "DStar","Metallaxis",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ crashing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ predicate : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
## $ ismutable : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
## $ mutability : num 0 0 0 0.112 0.119 ...
## $ time : num 132.4 104.2 68.5 58.8 64.8 ...
## $ einspect : num 4 100.5 39.5 11 29 ...
## $ is_bug_localized : int 1 1 1 1 1 1 1 1 1 1 ...
## $ exam : num 0.0099 0.3018 0.1282 0.0364 0.0967 ...
## $ java_exam_score : num 0.0099 0.3018 0.1282 0.0364 0.0967 ...
## $ output_length : int 265 197 188 182 180 180 175 174 166 171 ...
## $ cdist : num NA NA NA NA NA NA NA NA NA NA ...
## $ svcomp : num NA NA NA NA NA NA NA NA NA NA ...
## $ cumulative_distance2: num NA NA NA NA NA NA NA NA NA NA ...
## $ minutes : num 2.21 1.74 1.14 0.98 1.08 ...
## $ logtime : num 4.89 4.65 4.23 4.07 4.17 ...
## $ family : Factor w/ 4 levels "MBFL","PS","ST",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ category : Factor w/ 4 levels "CL","DEV","DS",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ bugid : Factor w/ 135 levels "black1","black10",..: 1 9 10 11 12 13 2 3 4 5 ...
We have data about 135 bugs
in rlength(unique(datas$project))` analyzed projects.
Let’s see an example of visual and statistical comparison of two groups of experiments for the same bugs.
To make the example concrete, let’s pick two groups and compare their \(E_{\text{inspect}}\) scores on statement-level fault localization:
Since there are three experiments per bug using SBFL, but only two experiments per bug using MBFL, we’ll aggregate scores for the same bug by average.
Let’s start with some visualization: a scatterplot with a point for each bug; each point has coordinates \(x, y\) where \(x\) is its score in MBFL and \(y\) its score in SBFL.
As you can see, there are a bulk of bugs for which SBFL performs very similarly to MBFL (points close to the \(x = y\) straight line). However, for several other bugs, SBFL is much better (remember that lower is better for this score).
Looking at the colors, we notice that several bugs in the CL (and possibly DS) category are overrepresented among the “harder” bugs on which SBFL behaves much better than MBFL.
Analyzing the same data numerically, we can compute the correlation (Kendall’s \(\tau\)) between \(S\) and \(M\):
##
## Kendall's rank correlation tau
##
## data: S and M
## z = 7.8047, p-value = 5.965e-15
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.5403952
A correlation of 0.5403952 is not super strong, but clearly defined.
Finally, we may also perform a statistical test (Wilcoxon’s paired test) and compute a matching effect size (Cliff’s delta).
##
## Wilcoxon signed rank test with continuity correction
##
## data: S and M
## V = 1068, p-value = 0.005209
## alternative hypothesis: true location shift is not equal to 0
##
## Cliff's Delta
##
## delta estimate: -0.1761866 (small)
## 95 percent confidence interval:
## lower upper
## -0.29548868 -0.05147369
Cliff’s delta, in particular, roughly measures how often the value in one set are larger than the value in the other set. Thus, the given value means that SBFL’s \(E_{\text{inspect}}\) score is smaller than MBFL’s roughly in 18% of the cases.
These statistics, for what they’re worth, seem to confirm that there is a noticeable difference in favor of SBFL.
Now, let’s generalize this to a scatterplot matrix to show the relations between all possible pairs of FL families.
First, we define a bunch of helper functions.
Then, we use them to generate plots for \(E\).
Now, it’s easy to compute a similar plot for other metrics. For example, running time (in minutes):
And also for technique in the SBFL and MBFL families (those where there is more than one technique of the same family).
Let’s build a simple multivariate regression model,
where we predict einspect and logtime from bug and technique.
Notice that we found it preferable to log-transform time (in seconds),
since this helps with the wide range of variability of running times among techniques.
In particular, ST runs in a matter of seconds, two order of magnitudes
faster than the next fastest family SBFL. If we do not
log-transform time, we still get generally sensible results,
but the advantage of ST over SBFL becomes watered down
and less clear. Thus, we stick with the log-transformed time.
Before proceeding with fitting, we standardize both predictors, so that it’s much easier to set sensible priors.
Here’s a basic regression model, where the only unusual aspects are that it’s multivariate, and uses a log-link function..
eq.m1 <- brmsformula(
mvbind(einspectS, timeS) ~ 0 + family + category,
family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)
pp1.check <- get_prior(eq.m1, data=by.statement)
pp1 <- c(
set_prior("normal(0, 1.0)", class="b", resp=c("einspectS", "timeS")),
set_prior("weibull(2, 1)", class="sigma", resp=c("einspectS", "timeS"))
)
Let’s do the usual checks to make sure that everything is fine with the fitting.
Prior checks, confirming that the sampled priors span a wide range of values, amply including the data.
Now we fit the actual model.
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 10.1 seconds.
## Chain 3 finished in 10.0 seconds.
## Chain 4 finished in 10.1 seconds.
## Chain 2 finished in 10.9 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 10.3 seconds.
## Total execution time: 11.2 seconds.
Next, we check the usual diagnostics:
## [1] 0
## [1] 1.003669
## [1] 0.3695669
Finally, we check the posteriors, to ensure that we have a decent approximation of the data.
As you can see, the simulated posteriors are decent given that the data is complex, whereas the model is quite simplistic (we’ll improve it soon).
## Family: MV(gaussian, gaussian)
## Links: mu = log; sigma = identity
## mu = log; sigma = identity
## Formula: einspectS ~ 0 + family + category
## timeS ~ 0 + family + category
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## einspectS_familyMBFL -2.96 0.49 -3.98 -2.09 1.00 6679
## einspectS_familyPS 0.43 0.08 0.27 0.58 1.00 5758
## einspectS_familyST 0.54 0.08 0.37 0.69 1.00 4489
## einspectS_familySBFL -3.39 0.47 -4.41 -2.56 1.00 5321
## einspectS_categoryDEV -2.52 0.49 -3.63 -1.71 1.00 4888
## einspectS_categoryDS -0.68 0.13 -0.92 -0.44 1.00 5069
## einspectS_categoryWEB -2.29 0.47 -3.34 -1.50 1.00 5021
## timeS_familyMBFL -0.60 0.14 -0.90 -0.35 1.00 1671
## timeS_familyPS -1.26 0.18 -1.62 -0.93 1.00 2571
## timeS_familyST -4.60 0.45 -5.55 -3.79 1.00 5776
## timeS_familySBFL -4.12 0.44 -5.08 -3.34 1.00 5346
## timeS_categoryDEV 0.64 0.17 0.32 0.98 1.00 1960
## timeS_categoryDS 0.86 0.15 0.58 1.18 1.00 1726
## timeS_categoryWEB 0.33 0.21 -0.10 0.73 1.00 2433
## Tail_ESS
## einspectS_familyMBFL 2906
## einspectS_familyPS 3133
## einspectS_familyST 3083
## einspectS_familySBFL 2467
## einspectS_categoryDEV 2175
## einspectS_categoryDS 3115
## einspectS_categoryWEB 2318
## timeS_familyMBFL 2202
## timeS_familyPS 3098
## timeS_familyST 2816
## timeS_familySBFL 2223
## timeS_categoryDEV 2613
## timeS_categoryDS 2312
## timeS_categoryWEB 3071
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS 0.86 0.02 0.82 0.90 1.00 6494 3213
## sigma_timeS 0.85 0.02 0.81 0.88 1.00 5999 2868
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS) 0.10 0.04 0.03 0.18 1.00 5418
## Tail_ESS
## rescor(einspectS,timeS) 3301
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
What’s noticeable here is the residual correlation
between the two outcomes einspect and time is smallish (10%),
which means that there is not much of a consistent dependency
between these two variables.
Let’s set up some functions to analyze the posterior samples of m1 (and similar models).
Let’s use these functions to first analyze the effects per family of FL techniques.
## $ints
## MBFL PS SBFL ST
## |0.5 -1.8353102 1.739195 -2.3158802 1.827165
## |0.7 -2.0553602 1.711112 -2.4126302 1.804245
## |0.9 -2.3861902 1.654794 -2.8586302 1.751850
## |0.95 -2.5145002 1.624174 -2.9680702 1.720207
## |0.99 -2.9997502 1.549535 -3.2901202 1.673284
## 0.99| -0.5121802 1.952189 -0.9237502 2.096772
## 0.95| -0.6458902 1.927664 -1.1574802 2.037926
## 0.9| -0.8250402 1.910703 -1.3378402 2.020518
## 0.7| -1.0481902 1.869024 -1.4724602 1.972580
## 0.5| -1.1912202 1.842191 -1.7029002 1.937703
##
## $est
## NULL
## $ints
## MBFL PS SBFL ST
## |0.5 1.960318 1.2660575 -1.6757025 -2.1966025
## |0.7 1.939706 1.2091175 -1.8622325 -2.3645725
## |0.9 1.830394 1.0957975 -2.1541725 -2.6666025
## |0.95 1.757469 1.0383075 -2.3414025 -2.9301925
## |0.99 1.681878 0.9058275 -2.7163425 -3.1681425
## 0.99| 2.409543 1.8271925 -0.4405425 -0.8711025
## 0.95| 2.307460 1.7190265 -0.6362325 -1.1655825
## 0.9| 2.284562 1.6675425 -0.7555525 -1.2291625
## 0.7| 2.225164 1.5814775 -0.9866125 -1.4500425
## 0.5| 2.150519 1.5101775 -1.1078825 -1.6046025
##
## $est
## NULL
It’s clear that for both outcomes, e_inspect and time,
there are clear differences (with high probability)
in the contribution over the mean from different families of techniques.
Looking at the effects by category of project does not
yeld as strong differences, but we can see that DS projects tend to be associated with worse (higher) e_inspect.
## $ints
## DEV DS WEB
## |0.5 -0.91667549 1.0764985 -0.62714549
## |0.7 -1.05566549 1.0244915 -0.86156549
## |0.9 -1.45905549 0.9409025 -1.16507549
## |0.95 -1.68509549 0.9035745 -1.39217549
## |0.99 -2.07679549 0.8313065 -1.80509549
## 0.99| 0.35950451 1.4586825 0.51750451
## 0.95| 0.18894451 1.3829805 0.39730451
## 0.9| 0.09513451 1.3503185 0.33071451
## 0.7| -0.10926549 1.2855685 0.09100451
## 0.5| -0.28688549 1.2429085 -0.01610549
##
## $est
## NULL
## $ints
## DEV DS WEB
## |0.5 -0.08602014 0.150728859 -0.40916514
## |0.7 -0.14248514 0.087022859 -0.47240714
## |0.9 -0.24966114 -0.004000141 -0.62202694
## |0.95 -0.29821414 -0.026865141 -0.69047224
## |0.99 -0.37835714 -0.119299141 -0.87612214
## 0.99| 0.48681286 0.667462859 0.22649886
## 0.95| 0.35430886 0.565192859 0.14131586
## 0.9| 0.28288386 0.487022859 0.07624786
## 0.7| 0.19699386 0.395402859 -0.04057914
## 0.5| 0.13734386 0.352488859 -0.13661914
##
## $est
## NULL
Let’s make the model more sophisticated, with varying effects, and modeling these effects as possibly correlated (which makes sense, since we have two model parts)
eq.m2 <- brmsformula(
mvbind(einspectS, timeS) ~ 1 + (1|p|family) + (1|q|category),
family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)
pp2.check <- get_prior(eq.m2, data=by.statement)
pp2 <- c(
set_prior("normal(0, 1.0)", class="Intercept", resp=c("einspectS", "timeS")),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family", resp=c("einspectS", "timeS")),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category", resp=c("einspectS", "timeS")),
set_prior("gamma(0.01, 0.01)", class="sigma", resp=c("einspectS", "timeS"))
)
Let’s fit \(m_2\) and check the fit.
Prior checks:
We fit model \(m_2\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 2 finished in 68.7 seconds.
## Chain 4 finished in 69.2 seconds.
## Chain 1 finished in 71.5 seconds.
## Chain 3 finished in 73.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 70.7 seconds.
## Total execution time: 73.4 seconds.
Diagnostics:
## [1] 0
## [1] 1.003221
## [1] 0.3363845
Posterior checks:
We can perhaps glean a small improvement compared to \(m_1\). Let’s compare the two models using LOO.
## Output of model 'm1':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2386.0 63.7
## p_loo 28.2 3.9
## looic 4772.0 127.3
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm2':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2383.1 63.9
## p_loo 28.7 4.0
## looic 4766.1 127.7
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
##
## Model comparisons:
## elpd_diff se_diff
## m2 0.0 0.0
## m1 -2.9 1.1
\(m_1\)’s score is more than 2.6 standard deviations worse than \(m_2\)’s, which is a significant difference in favor of \(m_2\) in terms of predictive capabilities.
## Family: MV(gaussian, gaussian)
## Links: mu = log; sigma = identity
## mu = log; sigma = identity
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category)
## timeS ~ 1 + (1 | p | family) + (1 | q | category)
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI
## sd(einspectS_Intercept) 0.65 0.12 0.44 0.90
## sd(timeS_Intercept) 0.39 0.10 0.22 0.63
## cor(einspectS_Intercept,timeS_Intercept) -0.21 0.27 -0.68 0.33
## Rhat Bulk_ESS Tail_ESS
## sd(einspectS_Intercept) 1.00 4409 3005
## sd(timeS_Intercept) 1.00 3590 2960
## cor(einspectS_Intercept,timeS_Intercept) 1.00 3043 2677
##
## ~family (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI
## sd(einspectS_Intercept) 0.91 0.12 0.69 1.17
## sd(timeS_Intercept) 0.92 0.12 0.69 1.18
## cor(einspectS_Intercept,timeS_Intercept) -0.04 0.18 -0.37 0.31
## Rhat Bulk_ESS Tail_ESS
## sd(einspectS_Intercept) 1.00 3766 2761
## sd(timeS_Intercept) 1.00 4365 2681
## cor(einspectS_Intercept,timeS_Intercept) 1.00 3364 2659
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## einspectS_Intercept -2.08 0.53 -3.10 -1.06 1.00 1975 1927
## timeS_Intercept -1.97 0.48 -2.90 -1.03 1.00 2581 2667
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS 0.86 0.02 0.82 0.90 1.00 6091 2669
## sigma_timeS 0.84 0.02 0.80 0.88 1.00 5068 2374
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS) 0.10 0.04 0.02 0.17 1.00 5481
## Tail_ESS
## rescor(einspectS,timeS) 2502
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
By category, there is a slight inverse correlation between
the two outcomes einspect and time; this correlation disappears
if we look at the family terms.
The residual correlation is the same as in m_1.
Let’s now perform an effects analysis
on the fitted coefficients of m2.
First we introduce a summary function suitable for varying effects models.
Then we use the summary function to analyze the effects of the FL techniques.
## $ints
## MBFL PS ST SBFL
## |0.5 -2.5262775 1.1176850 1.2182375 -2.8801300
## |0.7 -2.7709055 0.9477579 1.0616545 -3.1258475
## |0.9 -3.2180915 0.6775959 0.7788652 -3.5455925
## |0.95 -3.5103197 0.5359255 0.6447676 -3.7660067
## |0.99 -4.0641862 0.2329609 0.3315840 -4.2131177
## 0.99| -0.7022095 2.5920590 2.7000748 -0.9786326
## 0.95| -1.0036095 2.3085197 2.4006195 -1.3338787
## 0.9| -1.1621255 2.1533480 2.2573620 -1.5147860
## 0.7| -1.4888745 1.8624790 1.9778870 -1.8374510
## 0.5| -1.6824225 1.7035350 1.8101250 -2.0364775
##
## $est
## MBFL PS ST SBFL
## -2.132916 1.411442 1.518747 -2.473152
## $ints
## MBFL PS ST SBFL
## |0.5 1.5663375 0.8773568 -2.946875 -2.4867300
## |0.7 1.4073460 0.7103954 -3.209426 -2.7332090
## |0.9 1.1276775 0.4251590 -3.687169 -3.1826010
## |0.95 0.9677368 0.2451090 -3.925625 -3.4541217
## |0.99 0.6264373 -0.0473359 -4.352294 -4.0203258
## 0.99| 2.9782522 2.3132561 -1.073653 -0.5231171
## 0.95| 2.7062393 2.0606435 -1.438210 -0.8640560
## 0.9| 2.5780990 1.9228000 -1.598879 -1.0378780
## 0.7| 2.3357665 1.6489455 -1.930957 -1.4056300
## 0.5| 2.1721100 1.4788225 -2.124840 -1.6318900
##
## $est
## MBFL PS ST SBFL
## 1.862529 1.174628 -2.564445 -2.069716
The results are generally consistent with those of model \(m_1\), although some effects slightly weaken or strengthen.
Let’s see what happens for the bug/category of projects.
## $ints
## CL DEV DS WEB
## |0.5 0.8576515 -1.5181075 0.16223450 -1.33094250
## |0.7 0.7378781 -1.6876235 0.04688615 -1.50825150
## |0.9 0.5309405 -2.0290295 -0.18046820 -1.89460300
## |0.95 0.4075303 -2.2206682 -0.31008255 -2.12356500
## |0.99 0.1732455 -2.6481052 -0.55104856 -2.55194350
## 0.99| 2.0725035 -0.1770082 1.36204860 0.04386895
## 0.95| 1.7999425 -0.4056425 1.12326375 -0.19342832
## 0.9| 1.6730295 -0.5240378 0.99301000 -0.31669545
## 0.7| 1.4488815 -0.7515989 0.75661670 -0.56242150
## 0.5| 1.3150325 -0.9015535 0.62102200 -0.70125675
##
## $est
## CL DEV DS WEB
## 1.0899108 -1.2253783 0.3930164 -1.0401307
## $ints
## CL DEV DS WEB
## |0.5 -0.8226835 0.03591773 0.22997525 -0.23261775
## |0.7 -0.9169421 -0.05039076 0.15723225 -0.32312575
## |0.9 -1.0974810 -0.19879025 0.02132240 -0.48009930
## |0.95 -1.1933780 -0.28385260 -0.06325879 -0.57814627
## |0.99 -1.4628219 -0.48788514 -0.22869648 -0.75663510
## 0.99| -0.1105753 0.73849182 0.90394435 0.51016309
## 0.95| -0.2154603 0.59909808 0.76945853 0.35718410
## 0.9| -0.2763427 0.52126265 0.71423625 0.28251385
## 0.7| -0.4095806 0.39238235 0.57253150 0.14601540
## 0.5| -0.4943050 0.31617325 0.49809975 0.06830863
##
## $est
## CL DEV DS WEB
## -0.66390180 0.17212172 0.36407508 -0.08639932
Here we see some differences, which may partly be due to the fact that \(m_2\) models the different categories more uniformly. Furthermore, some changes may simply mean that the per-category effects are small, and hence likely to fluctuate with inconsequential changes to the model.
Now, let’s try a variant of \(m_2\) where we go back to
fixed intercepts but add an interaction term between family of FL techniques
and category of projects.
eq.m3 <- brmsformula(
mvbind(einspectS, timeS) ~
0 + family + category + (0 + family|r|category),
family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)
pp3.check <- get_prior(eq.m3, data=by.statement)
pp3 <- c(
set_prior("normal(0, 1.0)", class="b", resp=c("einspectS", "timeS")),
set_prior("gamma(0.01, 0.01)", class="sigma", resp=c("einspectS", "timeS")),
set_prior("lkj(1)", class="cor"),
set_prior("weibull(2, 0.3)", class="sd", resp=c("einspectS", "timeS"))
)
Let’s fit \(m_3\) and check the fit.
Prior checks:
We fit model \(m_3\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 2 finished in 106.9 seconds.
## Chain 4 finished in 108.5 seconds.
## Chain 3 finished in 110.1 seconds.
## Chain 1 finished in 113.4 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 109.7 seconds.
## Total execution time: 113.6 seconds.
Diagnostics:
## [1] 0
## [1] 1.003169
## [1] 0.2236839
Posterior checks:
In line with what seen before, possibly a bit better.
Model comparison:
## Output of model 'm1':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2386.0 63.7
## p_loo 28.2 3.9
## looic 4772.0 127.3
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm2':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2383.1 63.9
## p_loo 28.7 4.0
## looic 4766.1 127.7
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm3':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2379.3 64.3
## p_loo 32.1 4.2
## looic 4758.6 128.5
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 944 99.9% 1049
## (0.5, 0.7] (ok) 1 0.1% 1209
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
##
## Model comparisons:
## elpd_diff se_diff
## m3 0.0 0.0
## m2 -3.8 5.0
## m1 -6.7 4.9
\(m_2\)’s score is 0.76 standard deviations worse than \(m_3\)’s. This is not a significant improvement, not worth the additional complexity of model \(m_3\) (which also results in it being harder to interpret). Thus, we stick with \(m_2\) as our selected model.
## Family: MV(gaussian, gaussian)
## Links: mu = log; sigma = identity
## mu = log; sigma = identity
## Formula: einspectS ~ 0 + family + category + (0 + family | r | category)
## timeS ~ 0 + family + category + (0 + family | r | category)
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI
## sd(einspectS_familyMBFL) 0.28 0.15 0.05
## sd(einspectS_familyPS) 0.28 0.14 0.05
## sd(einspectS_familyST) 0.31 0.14 0.08
## sd(einspectS_familySBFL) 0.29 0.15 0.05
## sd(timeS_familyMBFL) 0.27 0.13 0.05
## sd(timeS_familyPS) 0.41 0.15 0.12
## sd(timeS_familyST) 0.28 0.15 0.05
## sd(timeS_familySBFL) 0.30 0.16 0.06
## cor(einspectS_familyMBFL,einspectS_familyPS) -0.02 0.33 -0.63
## cor(einspectS_familyMBFL,einspectS_familyST) -0.04 0.33 -0.65
## cor(einspectS_familyPS,einspectS_familyST) 0.01 0.33 -0.62
## cor(einspectS_familyMBFL,einspectS_familySBFL) 0.05 0.33 -0.59
## cor(einspectS_familyPS,einspectS_familySBFL) -0.01 0.33 -0.64
## cor(einspectS_familyST,einspectS_familySBFL) -0.04 0.33 -0.64
## cor(einspectS_familyMBFL,timeS_familyMBFL) 0.05 0.33 -0.59
## cor(einspectS_familyPS,timeS_familyMBFL) -0.01 0.33 -0.64
## cor(einspectS_familyST,timeS_familyMBFL) -0.11 0.32 -0.69
## cor(einspectS_familySBFL,timeS_familyMBFL) 0.04 0.33 -0.59
## cor(einspectS_familyMBFL,timeS_familyPS) 0.07 0.32 -0.56
## cor(einspectS_familyPS,timeS_familyPS) 0.04 0.31 -0.59
## cor(einspectS_familyST,timeS_familyPS) -0.20 0.31 -0.74
## cor(einspectS_familySBFL,timeS_familyPS) 0.07 0.33 -0.59
## cor(timeS_familyMBFL,timeS_familyPS) 0.21 0.32 -0.47
## cor(einspectS_familyMBFL,timeS_familyST) 0.04 0.34 -0.59
## cor(einspectS_familyPS,timeS_familyST) -0.02 0.34 -0.66
## cor(einspectS_familyST,timeS_familyST) -0.02 0.34 -0.64
## cor(einspectS_familySBFL,timeS_familyST) 0.04 0.33 -0.60
## cor(timeS_familyMBFL,timeS_familyST) 0.03 0.33 -0.60
## cor(timeS_familyPS,timeS_familyST) 0.02 0.33 -0.61
## cor(einspectS_familyMBFL,timeS_familySBFL) 0.03 0.33 -0.60
## cor(einspectS_familyPS,timeS_familySBFL) 0.04 0.33 -0.59
## cor(einspectS_familyST,timeS_familySBFL) -0.06 0.33 -0.67
## cor(einspectS_familySBFL,timeS_familySBFL) 0.04 0.33 -0.59
## cor(timeS_familyMBFL,timeS_familySBFL) 0.06 0.33 -0.58
## cor(timeS_familyPS,timeS_familySBFL) 0.09 0.32 -0.56
## cor(timeS_familyST,timeS_familySBFL) 0.02 0.33 -0.62
## u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(einspectS_familyMBFL) 0.61 1.00 3928 2036
## sd(einspectS_familyPS) 0.58 1.00 2417 1754
## sd(einspectS_familyST) 0.60 1.00 2423 2105
## sd(einspectS_familySBFL) 0.61 1.00 3977 1706
## sd(timeS_familyMBFL) 0.55 1.00 1492 1845
## sd(timeS_familyPS) 0.72 1.00 1746 1509
## sd(timeS_familyST) 0.61 1.00 4241 2286
## sd(timeS_familySBFL) 0.65 1.00 4117 2494
## cor(einspectS_familyMBFL,einspectS_familyPS) 0.61 1.00 4241 2811
## cor(einspectS_familyMBFL,einspectS_familyST) 0.60 1.00 4230 2591
## cor(einspectS_familyPS,einspectS_familyST) 0.62 1.00 3842 3036
## cor(einspectS_familyMBFL,einspectS_familySBFL) 0.66 1.00 4839 2735
## cor(einspectS_familyPS,einspectS_familySBFL) 0.62 1.00 4683 2978
## cor(einspectS_familyST,einspectS_familySBFL) 0.60 1.00 4427 3008
## cor(einspectS_familyMBFL,timeS_familyMBFL) 0.67 1.00 3856 3030
## cor(einspectS_familyPS,timeS_familyMBFL) 0.63 1.00 3837 2863
## cor(einspectS_familyST,timeS_familyMBFL) 0.52 1.00 3216 3076
## cor(einspectS_familySBFL,timeS_familyMBFL) 0.66 1.00 3102 3318
## cor(einspectS_familyMBFL,timeS_familyPS) 0.66 1.00 2999 2592
## cor(einspectS_familyPS,timeS_familyPS) 0.63 1.00 3110 3105
## cor(einspectS_familyST,timeS_familyPS) 0.44 1.00 2478 2674
## cor(einspectS_familySBFL,timeS_familyPS) 0.65 1.00 2859 3036
## cor(timeS_familyMBFL,timeS_familyPS) 0.75 1.00 1552 3466
## cor(einspectS_familyMBFL,timeS_familyST) 0.66 1.00 6251 2864
## cor(einspectS_familyPS,timeS_familyST) 0.61 1.00 4846 2539
## cor(einspectS_familyST,timeS_familyST) 0.61 1.00 4305 2680
## cor(einspectS_familySBFL,timeS_familyST) 0.67 1.00 3225 3055
## cor(timeS_familyMBFL,timeS_familyST) 0.65 1.00 2745 3049
## cor(timeS_familyPS,timeS_familyST) 0.65 1.00 3142 3447
## cor(einspectS_familyMBFL,timeS_familySBFL) 0.64 1.00 5482 2861
## cor(einspectS_familyPS,timeS_familySBFL) 0.64 1.00 4110 2875
## cor(einspectS_familyST,timeS_familySBFL) 0.59 1.00 3926 3090
## cor(einspectS_familySBFL,timeS_familySBFL) 0.65 1.00 3827 2946
## cor(timeS_familyMBFL,timeS_familySBFL) 0.66 1.00 3139 3044
## cor(timeS_familyPS,timeS_familySBFL) 0.67 1.00 2962 3484
## cor(timeS_familyST,timeS_familySBFL) 0.64 1.00 2192 3228
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## einspectS_familyMBFL -2.91 0.54 -4.06 -1.92 1.00 4282
## einspectS_familyPS 0.36 0.27 -0.22 0.91 1.00 1613
## einspectS_familyST 0.21 0.32 -0.49 0.75 1.00 1696
## einspectS_familySBFL -3.35 0.54 -4.46 -2.32 1.00 3821
## einspectS_categoryDEV -2.39 0.55 -3.56 -1.39 1.00 3427
## einspectS_categoryDS -0.56 0.33 -1.25 0.07 1.00 1268
## einspectS_categoryWEB -2.19 0.54 -3.30 -1.22 1.00 3829
## timeS_familyMBFL -0.28 0.31 -0.80 0.39 1.00 960
## timeS_familyPS -1.01 0.37 -1.71 -0.23 1.00 1226
## timeS_familyST -4.29 0.51 -5.32 -3.30 1.00 2830
## timeS_familySBFL -3.85 0.53 -4.91 -2.83 1.00 3641
## timeS_categoryDEV 0.23 0.39 -0.64 0.90 1.00 962
## timeS_categoryDS 0.50 0.39 -0.39 1.15 1.00 1036
## timeS_categoryWEB -0.09 0.41 -1.01 0.59 1.00 1261
## Tail_ESS
## einspectS_familyMBFL 2990
## einspectS_familyPS 1867
## einspectS_familyST 2585
## einspectS_familySBFL 2637
## einspectS_categoryDEV 2776
## einspectS_categoryDS 2440
## einspectS_categoryWEB 2990
## timeS_familyMBFL 1521
## timeS_familyPS 1720
## timeS_familyST 2771
## timeS_familySBFL 2705
## timeS_categoryDEV 1728
## timeS_categoryDS 1608
## timeS_categoryWEB 1828
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS 0.85 0.02 0.81 0.89 1.00 7246 2646
## sigma_timeS 0.84 0.02 0.81 0.88 1.00 6615 2828
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS) 0.08 0.04 0.01 0.16 1.00 6008
## Tail_ESS
## rescor(einspectS,timeS) 3210
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Instead of considering the fixed and varying effects of \(m_3\), we may estimate the marginal means for each family of FL techniques (results omitted for brevity, since we’ll focus on \(m_2\) anyway).
Let’s now add predictors to \(m_2\), so as to study any effect of the kinds of bugs:
predicate is a Boolean value that identifies predicate-related bugscrashing is a Boolean value that identifies crashing bugsmutability is a nonnegative score that denotes the percentage of
mutants that mutate a line in a bug’s ground truthmutable is a Boolean that identifies the bugs with a
positive mutability scoreSince mutability/mutable are likely affecting category
and einspect, it makes sense to add the predictor,
so as to close the possible backdoor path \(\textrm{category} \leftarrow \textrm{mutable} \rightarrow \textrm{einspect}\).
We are only interested in controlling for bug kind for einspect,
thus switch to an univariate model where einspect is the only outcome variable.
eq.m4.einspect <- brmsformula(einspectS ~ 1
+ (1|p|family) + (1|q|category)
+ predicate*family
+ crashing*family
+ ismutable*family,
family=brmsfamily("gaussian", link="log"))
eq.m4 <- eq.m4.einspect
pp4.check <- get_prior(eq.m4, data=by.statement)
pp4 <- c(
set_prior("normal(0, 1.0)", class="Intercept"),
set_prior("normal(0, 1.0)", class="b"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category"),
set_prior("gamma(0.01, 0.01)", class="sigma")
)
Prior checks:
We fit model \(m_4\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 4 finished in 25.8 seconds.
## Chain 2 finished in 26.9 seconds.
## Chain 3 finished in 28.1 seconds.
## Chain 1 finished in 34.8 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 28.9 seconds.
## Total execution time: 34.9 seconds.
Diagnostics:
## [1] 0
## [1] 1.003327
## [1] 0.3383162
Posterior checks:
Since \(m_4\) uses less data than the previous models
(it doesn’t consider outcome time), we cannot it compare it to
the other models using LOO (or any information criterion, for that matter).
## Family: gaussian
## Links: mu = log; sigma = identity
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category) + predicate * family + crashing * family + ismutable * family
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.70 0.12 0.48 0.95 1.00 4044 3090
##
## ~family (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.46 0.20 0.10 0.85 1.00 1799 1862
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept -2.67 0.65 -3.88 -1.36 1.00 2428
## predicateTRUE -0.46 0.50 -1.45 0.50 1.00 2012
## familyPS 1.62 0.68 0.22 2.87 1.00 2388
## familyST 1.89 0.69 0.38 3.12 1.00 2314
## familySBFL -1.20 0.70 -2.57 0.17 1.00 4081
## crashingTRUE -1.15 0.52 -2.18 -0.15 1.00 2799
## ismutableTRUE -0.25 0.48 -1.16 0.69 1.00 1965
## predicateTRUE:familyPS -0.07 0.52 -1.06 0.94 1.00 2010
## predicateTRUE:familyST 0.48 0.50 -0.49 1.48 1.00 1980
## predicateTRUE:familySBFL -0.02 0.86 -1.73 1.66 1.00 5036
## familyPS:crashingTRUE 1.01 0.54 -0.02 2.06 1.00 2822
## familyST:crashingTRUE -1.82 0.65 -3.12 -0.57 1.00 3334
## familySBFL:crashingTRUE 0.03 0.87 -1.67 1.67 1.00 4750
## familyPS:ismutableTRUE 1.01 0.49 0.02 1.96 1.00 2148
## familyST:ismutableTRUE 0.87 0.49 -0.12 1.81 1.00 1914
## familySBFL:ismutableTRUE -0.29 0.81 -1.84 1.31 1.00 4679
## Tail_ESS
## Intercept 2629
## predicateTRUE 2368
## familyPS 2632
## familyST 2882
## familySBFL 2796
## crashingTRUE 2913
## ismutableTRUE 2223
## predicateTRUE:familyPS 2540
## predicateTRUE:familyST 2460
## predicateTRUE:familySBFL 2962
## familyPS:crashingTRUE 2868
## familyST:crashingTRUE 2923
## familySBFL:crashingTRUE 2746
## familyPS:ismutableTRUE 2617
## familyST:ismutableTRUE 2329
## familySBFL:ismutableTRUE 2819
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.78 0.02 0.75 0.82 1.00 6433 2648
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Let’s now perform an effects analysis
on the fitted coefficients of m4.
Specifically, we look at the (fixed) effects
of the families associated with certain categories of bugs,
for response einspect.
## $ints
## crashing MBFL crashing PS crashing ST crashing SBFL
## |0.5 -1.4957225 0.64172875 -2.2535900 -0.5706570
## |0.7 -1.6966420 0.45311365 -2.5091630 -0.8849482
## |0.9 -2.0240975 0.14002575 -2.8906505 -1.4130940
## |0.95 -2.1821472 -0.01729245 -3.1170247 -1.6737832
## |0.99 -2.5282891 -0.35063686 -3.5874488 -2.1767845
## 0.99| 0.1074132 2.44609395 -0.2224062 2.1764674
## 0.95| -0.1510499 2.05582775 -0.5708796 1.6714243
## 0.9| -0.3032274 1.89343600 -0.7896973 1.4436930
## 0.7| -0.6125321 1.57059350 -1.1579595 0.9424029
## 0.5| -0.7958315 1.37090000 -1.3684550 0.6420823
##
## $est
## crashing MBFL crashing PS crashing ST crashing SBFL
## -1.15095905 1.00587635 -1.82331692 0.02942783
## $ints
## predicate MBFL predicate PS predicate ST predicate SBFL
## |0.5 -0.80539225 -0.4252112 0.14267150 -0.5879013
## |0.7 -0.97924005 -0.5941629 -0.04137106 -0.8908687
## |0.9 -1.27643350 -0.8829843 -0.30878825 -1.4396120
## |0.95 -1.44913900 -1.0634027 -0.49199142 -1.7304670
## |0.99 -1.85374910 -1.3837009 -0.76560686 -2.2351072
## 0.99| 0.81469355 1.3561932 1.89115115 2.0862540
## 0.95| 0.50288860 0.9428116 1.47530875 1.6589353
## 0.9| 0.33490690 0.7920933 1.31585700 1.3988270
## 0.7| 0.05072069 0.4752352 1.00939850 0.8711154
## 0.5| -0.12628675 0.2727390 0.81871150 0.5510417
##
## $est
## predicate MBFL predicate PS predicate ST predicate SBFL
## -0.46324267 -0.07029723 0.48350202 -0.02474266
## $ints
## ismutable MBFL ismutable PS ismutable ST ismutable SBFL
## |0.5 -0.56463550 0.67018100 0.54703600 -0.8342113
## |0.7 -0.74068020 0.50529050 0.35888690 -1.1446125
## |0.9 -1.02027550 0.20357500 0.05512335 -1.6622670
## |0.95 -1.16404750 0.01718417 -0.12405297 -1.8383202
## |0.99 -1.46524645 -0.30899653 -0.47210512 -2.3507601
## 0.99| 1.03952155 2.31925060 2.11852570 1.6988942
## 0.95| 0.69069932 1.95853000 1.81475750 1.3090160
## 0.9| 0.52912870 1.82134500 1.66444000 1.0380830
## 0.7| 0.24578715 1.51742650 1.36737800 0.5435591
## 0.5| 0.07147647 1.33092500 1.19420250 0.2685713
##
## $est
## ismutable MBFL ismutable PS ismutable ST ismutable SBFL
## -0.2487351 1.0068292 0.8698194 -0.2899824
So, crashing bugs are indeed easier for ST. In contrast, predicate-related bugs do not seem to be simpler for PS.
For the mutability bugs, we don’t find any consistent
association. Thus, let’s try to add to the model
a finer-grained dependency on mutability
rather than just the boolean indicator mutable.
A simple way would be to introduce an interaction mutability\(\times\)family.
eq.m5.einspect <- brmsformula(einspectS ~ 1
+ (1|p|family) + (1|q|category)
+ predicate*family
+ crashing*family
+ mutability*family,
family=brmsfamily("gaussian", link="log"))
eq.m5 <- eq.m5.einspect
pp5.check <- get_prior(eq.m5, data=by.statement)
pp5 <- c(
set_prior("normal(0, 1.0)", class="Intercept"),
set_prior("normal(0, 1.0)", class="b"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category"),
set_prior("gamma(0.01, 0.01)", class="sigma")
)
We could get passable (not great) prior checks, but let’s cut to the chase and fit model \(m_5\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 3.6 seconds.
## Chain 3 finished in 3.5 seconds.
## Chain 4 finished in 96.5 seconds.
## Chain 2 finished in 114.8 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 54.6 seconds.
## Total execution time: 115.2 seconds.
## Warning: 438 of 4000 (11.0%) transitions hit the maximum treedepth limit of 10.
## See https://mc-stan.org/misc/warnings for details.
## Warning: 2 of 4 chains have a NaN E-BFMI.
## See https://mc-stan.org/misc/warnings for details.
The first thing that we notice is that two of the four chains terminated very quickly (suspiciously fast), whereas the other two went awry and spinned for much longer. In addition, we got a number of scary warnings. This points to some region of the posterior that could not be sampled effectively.
Let’s see the diagnostics:
## [1] 0
## [1] 5.472564
## [1] 0.001008065
A disaster. Let’s also plot the trace plots.
Two chains are straight lines, and hence did not mix at all with the others!
Notice that the distribution of mutability is very skewed,
which explains the difficulties in fitting \(m_5\).
The most straightforward way out of this ditch
is to simply log-transform mutability (after adding 1 to all percentages
so that all logs are defined).
by.statement$logmutability <- log(1 + by.statement$mutability)
eq.m6.einspect <- brmsformula(einspectS ~ 1
+ (1|p|family) + (1|q|category)
+ predicate*family
+ crashing*family
+ logmutability*family,
family=brmsfamily("gaussian", link="log"))
eq.m6 <- eq.m6.einspect
pp6.check <- get_prior(eq.m6, data=by.statement)
pp6 <- c(
set_prior("normal(0, 1.0)", class="Intercept"),
set_prior("normal(0, 1.0)", class="b"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category"),
set_prior("gamma(0.01, 0.01)", class="sigma")
)
Alternative ways to modif \(m_5\) so that it can be analyzed (which we mention but don’t further explore here):
Introducing a multi-level term, with einspect ~ log(x)*family,
and log(x) = log(y) + a, where \(x/y = \textrm{mutability}\).
This is based on rewriting \(\log(a/b) = \alpha\)
into \(\log(a) = \alpha + \log(b)\).
The approach followed in this paper.
Prior checks:
We fit model \(m_6\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 3 finished in 34.1 seconds.
## Chain 2 finished in 35.4 seconds.
## Chain 4 finished in 35.9 seconds.
## Chain 1 finished in 37.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 35.6 seconds.
## Total execution time: 37.3 seconds.
Diagnostics:
## [1] 0
## [1] 1.003259
## [1] 0.4447736
Posterior checks:
Everything is A-OK now.
Let’s compare the models \(m_4\) and \(m_6\) using LOO.
## Output of model 'm4':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1143.7 57.6
## p_loo 62.6 9.7
## looic 2287.5 115.2
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 938 99.3% 286
## (0.5, 0.7] (ok) 6 0.6% 155
## (0.7, 1] (bad) 1 0.1% 45
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm6':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1155.0 62.4
## p_loo 52.5 8.7
## looic 2310.0 124.8
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 940 99.5% 404
## (0.5, 0.7] (ok) 5 0.5% 118
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
##
## Model comparisons:
## elpd_diff se_diff
## m4 0.0 0.0
## m6 -11.3 18.3
\(m_6\) and \(m_4\) are very close in terms of predictive capabilities.
## Family: gaussian
## Links: mu = log; sigma = identity
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category) + predicate * family + crashing * family + logmutability * family
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.71 0.12 0.49 0.97 1.00 3682 3030
##
## ~family (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.48 0.18 0.14 0.84 1.00 2042 2078
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept -2.39 0.63 -3.63 -1.16 1.00 2389
## predicateTRUE -0.26 0.49 -1.22 0.69 1.00 2375
## familyPS 1.60 0.67 0.17 2.80 1.00 2430
## familyST 1.90 0.66 0.48 3.12 1.00 2150
## familySBFL -1.28 0.71 -2.67 0.12 1.00 3825
## crashingTRUE -1.19 0.53 -2.25 -0.19 1.00 2228
## logmutability -0.53 0.35 -1.28 0.11 1.00 1944
## predicateTRUE:familyPS 0.06 0.50 -0.94 1.03 1.00 2371
## predicateTRUE:familyST 0.58 0.50 -0.38 1.55 1.00 2406
## predicateTRUE:familySBFL -0.08 0.85 -1.78 1.54 1.00 4831
## familyPS:crashingTRUE 1.06 0.54 0.05 2.15 1.00 2270
## familyST:crashingTRUE -1.79 0.68 -3.16 -0.49 1.00 3195
## familySBFL:crashingTRUE 0.01 0.84 -1.68 1.61 1.00 4947
## familyPS:logmutability 0.63 0.35 -0.00 1.36 1.00 1947
## familyST:logmutability 0.52 0.35 -0.13 1.26 1.00 1954
## familySBFL:logmutability -0.11 0.63 -1.46 0.99 1.00 3366
## Tail_ESS
## Intercept 2811
## predicateTRUE 2740
## familyPS 2606
## familyST 2775
## familySBFL 2990
## crashingTRUE 2284
## logmutability 2136
## predicateTRUE:familyPS 2542
## predicateTRUE:familyST 2562
## predicateTRUE:familySBFL 3103
## familyPS:crashingTRUE 2604
## familyST:crashingTRUE 2819
## familySBFL:crashingTRUE 3009
## familyPS:logmutability 2147
## familyST:logmutability 2130
## familySBFL:logmutability 2465
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.80 0.02 0.76 0.83 1.00 4795 3099
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## $ints
## logmutability MBFL logmutability PS logmutability ST logmutability SBFL
## |0.5 -0.75474450 0.375940500 0.266247750 -0.5156230
## |0.7 -0.88416690 0.254462550 0.147502000 -0.7639471
## |0.8 -0.97204130 0.182238400 0.078897560 -0.9411668
## |0.87 -1.05968925 0.115193315 0.008621432 -1.1116126
## |0.9 -1.12050650 0.085546755 -0.027511595 -1.2257940
## |0.95 -1.28001225 -0.004312041 -0.130285675 -1.4553477
## |0.99 -1.47134080 -0.178970675 -0.261238665 -1.8345544
## 0.99| 0.25625661 1.564178850 1.469844800 1.3301171
## 0.95| 0.10875320 1.362056500 1.258613250 0.9929937
## 0.9| 0.01516966 1.221132000 1.107307500 0.8401534
## 0.87| -0.02294809 1.163467750 1.050067700 0.7640344
## 0.8| -0.09323496 1.068632000 0.955611000 0.6521943
## 0.7| -0.16356845 0.982931800 0.874557150 0.5276751
## 0.5| -0.27993625 0.855096250 0.749353000 0.3318090
##
## $est
## logmutability MBFL logmutability PS logmutability ST logmutability SBFL
## -0.5291964 0.6255506 0.5170813 -0.1092753
## [1] TRUE
There is a weak tendency for MBFL to do better on mutable bugs, but it can only be detected with 87% confidence (which is still decent). Incidentally, PS (and, to a lesser degree, ST) tends to perform worse on the same kinds of bugs, whereas SBFL is agnostic.
Finally, let’s also collect the varying intercepts
estimates and intervals for the group-level terms
for family and category. In \(m_6\) these now correspond
to the effects on bugs that are in none of the special categories
(crashing, predicate, mutable); since this is a relatively set,
we don’t expect any very strong tendency (simply because the data is limited).
## $ints
## MBFL PS ST SBFL
## |0.5 -1.09837750 -0.03866348 0.008106222 -1.03776500
## |0.7 -1.31961950 -0.16916845 -0.123050900 -1.32181800
## |0.9 -1.79294900 -0.45615560 -0.354203300 -1.84372150
## |0.95 -2.05672725 -0.59820110 -0.500310250 -2.13864925
## |0.99 -2.59430055 -0.91827276 -0.792044045 -2.81509720
## 0.99| 0.31546326 1.81253300 1.835731450 0.50949495
## 0.95| 0.12512723 1.29965575 1.372258250 0.24451078
## 0.9| 0.03854548 1.09694800 1.147540000 0.13960190
## 0.7| -0.16430215 0.72556065 0.774333900 -0.05992452
## 0.5| -0.31820900 0.52408350 0.571569000 -0.19567100
##
## $est
## MBFL PS ST SBFL
## -0.7456581 0.2577865 0.3136117 -0.6669348
## $ints
## CL DEV DS WEB
## |0.5 0.60921450 -1.9213750 0.2900920 -1.8046375
## |0.7 0.47578020 -2.1417310 0.1576821 -2.0088575
## |0.9 0.21938015 -2.5400950 -0.1101333 -2.3996800
## |0.95 0.02653933 -2.7496218 -0.2652119 -2.6116985
## |0.99 -0.29597590 -3.1878640 -0.6054692 -2.9940812
## 0.99| 1.81451660 -0.4967978 1.4707381 -0.2732759
## 0.95| 1.56997425 -0.7112183 1.2597173 -0.5808576
## 0.9| 1.45210450 -0.8386855 1.1383780 -0.7016571
## 0.7| 1.23746050 -1.1045085 0.9314273 -0.9619804
## 0.5| 1.11337000 -1.2482250 0.7968602 -1.1282750
##
## $est
## CL DEV DS WEB
## 0.8508594 -1.6101808 0.5346215 -1.4848656
Let’s prepare and print some plots of the overall results for model \(m_6\).
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In order to more clearly read the plots, let’s also save the various interval endpoints in absolute units, that is, convert them to the outcome scale, both in standardized units and in absolute units.
## [1] TRUE
## [1] TRUE
## [1] TRUE
## Saving 7 x 5 in image
## [1] "paper/m2-family.pdf"
## Saving 7 x 5 in image
## [1] "paper/m2-category.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-crashing.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-predicate.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-mutable.pdf"